{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2WSnqj3dtf4F"
   },
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\">Detecting Fraud with Tabular Embeddings</h1>\n",
    "\n",
    "Imagine you maintain a fraud-detection service for your e-commerce company. In the past few weeks, there's been an alarming spike in undetected cases of fraudulent credit card transactions. These false negatives are hurting your bottom line, and you've been tasked with solving the issue.\n",
    "\n",
    "Phoenix provides opinionated workflows to surface feature drift and data quality issues quickly so you can get straight to the root-cause of the problem. As you'll see, your fraud-detection service is receiving more and more traffic from an untrustworthy merchant, causing your model's false negative rate to skyrocket.\n",
    "\n",
    "In this tutorial, you will:\n",
    "* Download curated datasets of credit card transaction and fraud-detection data\n",
    "* Compute tabular embeddings to represent each transaction\n",
    "* Pinpoint fraudulent transactions from a suspicious merchant\n",
    "* Export data from this merchant to retrain your model\n",
    "\n",
    "Let's get started!\n",
    "\n",
    "## Install Dependencies and Import Libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2WSnqj3dtf4F"
   },
   "source": [
    "Install Phoenix and the Arize SDK, which provides convenience methods for extracting embeddings for tabular data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -Uq \"arize-phoenix[embeddings]\" \"arize[AutoEmbeddings]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures\n",
    "from sklearn.metrics import recall_score\n",
    "\n",
    "import phoenix as px"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2ux2rILWtf4I"
   },
   "source": [
    "## Download the Data\n",
    "\n",
    "Download your training and production data from a fraud detection model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.read_parquet(\n",
    "    \"http://storage.googleapis.com/arize-phoenix-assets/datasets/structured/credit-card-fraud/credit_card_fraud_train.parquet\",\n",
    ")\n",
    "prod_df = pd.read_parquet(\n",
    "    \"http://storage.googleapis.com/arize-phoenix-assets/datasets/structured/credit-card-fraud/credit_card_fraud_production.parquet\",\n",
    ")\n",
    "train_df = train_df.reset_index(\n",
    "    drop=True\n",
    ")  # recommended when using EmbeddingGeneratorForTabularFeatures\n",
    "prod_df = prod_df.reset_index(\n",
    "    drop=True\n",
    ")  # recommended when using EmbeddingGeneratorForTabularFeatures\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZebmnDlutf4J"
   },
   "source": [
    "The columns of the dataframe are:\n",
    "- **prediction_id:** the unique ID for each prediction\n",
    "- **prediction_timestamp:** the timestamps of your predictions\n",
    "- **predicted_label:** the label your model predicted\n",
    "- **predicted_score:** the score of each prediction\n",
    "- **actual_label:** the true, ground-truth label for each prediction (fraud vs. not_fraud)\n",
    "- **tabular_vector:** pre-computed tabular embeddings for each row of data\n",
    "- **age:** a tag used to filter your data in the Phoenix UI\n",
    "\n",
    "The rest of the columns are features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_column_names = [\n",
    "    \"fico_score\",\n",
    "    \"loan_amount\",\n",
    "    \"term\",\n",
    "    \"interest_rate\",\n",
    "    \"installment\",\n",
    "    \"grade\",\n",
    "    \"home_ownership\",\n",
    "    \"annual_income\",\n",
    "    \"verification_status\",\n",
    "    \"pymnt_plan\",\n",
    "    \"addr_state\",\n",
    "    \"dti\",\n",
    "    \"delinq_2yrs\",\n",
    "    \"inq_last_6mths\",\n",
    "    \"mths_since_last_delinq\",\n",
    "    \"mths_since_last_record\",\n",
    "    \"open_acc\",\n",
    "    \"pub_rec\",\n",
    "    \"revol_bal\",\n",
    "    \"revol_util\",\n",
    "    \"state\",\n",
    "    \"merchant_ID\",\n",
    "    \"merchant_risk_score\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZebmnDlutf4J"
   },
   "source": [
    "## Compute Tabular Embeddings (Optional)\n",
    "\n",
    "⚠️ This step requires a GPU.\n",
    "\n",
    "Run the cell below to compute embeddings for your tabular data from scratch, or skip this step to use the pre-computed embeddings downloaded with the rest of your data.\n",
    "\n",
    "`EmbeddingGeneratorForTabularFeatures` represents each row of your dataframe as a piece of text and computes an embedding for that text using a pre-trained language model (in this case, \"distilbert-base-uncased\"). For example, if a row of your dataframe represents a transaction in the state of California from a merchant named \"Leannon Ward\" with a FICO score of 616 and a merchant risk score of 23, `EmbeddingGeneratorForTabularFeatures` computes an embedding for the text: \"The state is CA. The merchant ID is Leannon Ward. The fico score is 616. The merchant risk score is 23...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generator = EmbeddingGeneratorForTabularFeatures(\n",
    "    model_name=\"distilbert-base-uncased\",\n",
    ")\n",
    "train_df[\"tabular_vector\"] = generator.generate_embeddings(\n",
    "    train_df,\n",
    "    selected_columns=feature_column_names,\n",
    ")\n",
    "prod_df[\"tabular_vector\"] = generator.generate_embeddings(\n",
    "    prod_df,\n",
    "    selected_columns=feature_column_names,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MW-npos-tf4J"
   },
   "source": [
    "## Launch Phoenix\n",
    "\n",
    "Define a schema to tell Phoenix what the columns of your dataframe represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://arize.com/docs/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = px.Schema(\n",
    "    prediction_id_column_name=\"prediction_id\",\n",
    "    prediction_label_column_name=\"predicted_label\",\n",
    "    prediction_score_column_name=\"predicted_score\",\n",
    "    actual_label_column_name=\"actual_label\",\n",
    "    timestamp_column_name=\"prediction_timestamp\",\n",
    "    feature_column_names=feature_column_names,\n",
    "    tag_column_names=[\"age\"],\n",
    "    embedding_feature_column_names={\n",
    "        \"tabular_embedding\": px.EmbeddingColumnNames(\n",
    "            vector_column_name=\"tabular_vector\",\n",
    "        ),\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vmM01H1Ytf4K"
   },
   "source": [
    "Create Phoenix datasets that wrap your dataframes with schemas that describe them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prod_ds = px.Inferences(dataframe=prod_df, schema=schema, name=\"production\")\n",
    "train_ds = px.Inferences(dataframe=train_df, schema=schema, name=\"training\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y0LzPbuytf4K"
   },
   "source": [
    "Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI in the notebook or in a new browser tab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(session := px.launch_app(primary=prod_ds, reference=train_ds)).view()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session.view()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jFYdi3vktf4L"
   },
   "source": [
    "## Detect Feature Drift Using PSI\n",
    "\n",
    "Phoenix helps you detect feature drift that is causing your model's performance to degrade in production. In particular, it uses population stability index (PSI), a statistical measure that quantifies the difference between two probability distributions. When used to compare the training and production distributions for a feature, PSI values are often interpreted as follows:\n",
    "\n",
    "- **PSI < 0.1:** The model is stable. No significant changes have been observed between the training and production distributions.\n",
    "- **0.1 <= PSI < 0.25:** Minor drift has been observed, but these are not expected to have a large impact on the model's performance.\n",
    "- **PSI >= 0.25:** Significant change has been observed that may impact the model's performance in production.\n",
    "\n",
    "Notice that the PSI value for the `merchant_ID` feature is well above the threshold of 0.25, indicating that the production distribution of this feature has drifted significantly from the training distribution. Click on `merchant_ID` to view the dimension details for this feature. Notice that the training and production distributions for this feature are virtually identical at times when the PSI is low. However, if you select a period of time during which the PSI is high, the distributions are very different. In particular, the cardinality of the merchant ID feature increases in production to include a new merchant ID, Scammeds, not included in the training dataset. This sudden spike in transactions from a new merchant looks suspicious, so you decide to investigate further.\n",
    "\n",
    "<img src=\"http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/dimension_details.gif\" alt=\"find fraudulent merchant using psi\" style=\"width:100%; height:auto;\">\n",
    "\n",
    "## Pinpoint Fraudulent Merchant Data Using Tabular Embeddings\n",
    "\n",
    "Click on \"tabular_embedding\" in the \"Embeddings\" section.\n",
    "\n",
    "![click on tabular embeddings](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/click_on_tabular_embedding.png)\n",
    "\n",
    "Select a period of high drift in the Euclidean distance graph at the top.\n",
    "\n",
    "![select period of high drift](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/select_period_of_high_drift.png)\n",
    "\n",
    "Hover over the top clusters in the left side panel. Phoenix has identified these clusters as problematic because they consist entirely or almost entirely of production data, meaning that your model is making production inferences on data the likes of which it never saw during training.\n",
    "\n",
    "❗ Your point cloud won't look identical to the one in this picture if you computed your own embeddings.\n",
    "\n",
    "![hover over top clusters](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/hover_over_top_clusters.png)\n",
    "\n",
    "In the display settings in the bottom left, select \"dimension\" in the \"Color By\" dropdown. Then select the \"merchant_ID\" feature in the \"Dimension\" dropdown. Notice that the troublesome clusters you found in the previous step are coming from the Scammeds merchant.\n",
    "\n",
    "![color by merchant id](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/color_by_merchant_id.png)\n",
    "\n",
    "Next, color your data by correctness to confirm. Notice that many of the data points in the Scammeds clusters are incorrectly classified.\n",
    "\n",
    "![color by correctness](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/color_by_correctness.png)\n",
    "\n",
    "Select the relevant data using the lasso and export the data for further analysis.\n",
    "\n",
    "![select points with lasso and export](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/credit-card-fraud-detection-tutorial/select_points_with_lasso_and_export.png)\n",
    "\n",
    "## View and Analyze Exported Data\n",
    "\n",
    "View your most recently exported data as a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_df = session.exports[-1]\n",
    "export_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the false negative rate on your exported data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_df = export_df[\n",
    "    export_df.actual_label != \"uncertain\"\n",
    "]  # remove rows with unknown ground-truth\n",
    "recall = recall_score(\n",
    "    y_true=export_df.actual_label == \"fraud\", y_pred=export_df.predicted_label == \"fraud\"\n",
    ")\n",
    "false_negative_rate = 1 - recall\n",
    "false_negative_rate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "siUbGmK2tf4L"
   },
   "source": [
    "That false negative rate is unacceptably high. Congrats! You've identified the fraudulent merchant causing your false negative rate to spike. As an actionable next step, you can fine-tune your model on the misclassified examples and report the fraudulent merchant to the proper authorities."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
